Linear regression is a fundamental statistical method used to model the relationship between a dependent variable and one or more independent variables. It is widely used in various fields such as economics, biology, engineering, and social sciences.
There are several types of linear regression, depending on the number of variables and the nature of the relationship:
Simple linear regression involves one independent variable and one dependent variable. The model is expressed as:
\[ Y = \beta_0 + \beta_1 X + \epsilon \]
Where:
Multiple linear regression includes two or more independent variables:
\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p + \gamma_1 Z_1 + \gamma_2 Z_2 + \dots + \gamma_q Z_q + \epsilon \]
Where:
One of the method commun method for Multiple Linear Regression is Ordinary Least Squares (OLS), used to estimate the parameters (coefficients) of a linear regression model. OLS finds the line that minimizes the sum of the squared differences between the observed values and the predicted values.
Ridge regression is a type of regularized regression that adds a penalty term to the loss function:
\[ \text{Loss} = \sum (y_i - \hat{y}_i)^2 + \lambda \sum \beta_j^2 \]
This helps reduce overfitting in the presence of multicollinearity.
Lasso (Least Absolute Shrinkage and Selection Operator) also adds a penalty term but uses the absolute value of coefficients:
\[ \text{Loss} = \sum (y_i - \hat{y}_i)^2 + \lambda \sum |\beta_j| \]
To compute a p-value for each coefficient in a linear regression model, we:
\[ t_i = \frac{\hat{\beta}_i}{SE(\hat{\beta}_i)} \]
OLS is the most appropriate method for computing p-values in linear regression because:
To compute p-values in OLS regression, we use the following core statistics:
A critical step in OLS is computing the inverse of the matrix \(X^TX\), which is needed to estimate coefficients and their standard errors. However, this matrix becomes singular (non-invertible) or nearly singular (ill-conditioned) when predictors are highly collinear that is, when one or more predictors are linearly dependent or strongly correlated.
In such cases, the computation of \(X^TX\) approaches zero, making \((X^TX)^{-1}\) either impossible to compute or extremely unstable, which in turn leads to unreliable and highly sensitive coefficient estimates.
Collinearity occurs when two or more predictor variables are highly correlated, meaning that in the stoat case every regression table are collinear.
# Set seed for reproducibility
set.seed(123)
# X1 can be any continuous variable
X1 <- runif(100, 0, 1)
# Make X2 perfectly collinear with X1
X2 <- X1 # now X2 = X1 exactly
# Response variable Y
Y <- 5 + 2*X1 + rnorm(100, sd = 0.1)
# Fit linear model with perfectly collinear predictors
model2 <- lm(Y ~ X1 + X2)
summary(model2)
##
## Call:
## lm(formula = Y ~ X1 + X2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.223797 -0.061323 -0.001973 0.059633 0.221723
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.99910 0.01961 254.99 <2e-16 ***
## X1 1.99102 0.03418 58.25 <2e-16 ***
## X2 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09693 on 98 degrees of freedom
## Multiple R-squared: 0.9719, Adjusted R-squared: 0.9716
## F-statistic: 3393 on 1 and 98 DF, p-value: < 2.2e-16
When predictors are nearly perfectly collinear, the model may fail to estimate coefficients properly (this case can happend when the number of column are > 2 and at 2 column are identical) :
set.seed(123)
n <- 100
# First column
X1 <- runif(n, 0, 1)
# Second column is exactly the same as the first
X2 <- X1 # perfect collinearity
# Third column ensures rows sum to 1
X3 <- 1 - (X1 + X2)
# Response variable
Y <- 5 + 2*X1 + 3*X2 + rnorm(n)
# Fit linear model
df <- data.frame(Y = Y, X1 = X1, X2 = X2, X3 = X3)
model <- lm(Y ~ ., data = df)
summary(model)
##
## Call:
## lm(formula = Y ~ ., data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.23797 -0.61323 -0.01973 0.59633 2.21723
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.9910 0.1961 25.46 <2e-16 ***
## X1 4.9102 0.3418 14.37 <2e-16 ***
## X2 NA NA NA NA
## X3 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9693 on 98 degrees of freedom
## Multiple R-squared: 0.678, Adjusted R-squared: 0.6747
## F-statistic: 206.3 on 1 and 98 DF, p-value: < 2.2e-16
To handle this special case of collinearity, one of the strategies we will examine in this report is to removing the last column of each predictor matrix \(X\), and merging columns that are perfectly redundant.
## P-value distribution [min/max/mean] :
## Method Min Mean Max
## 1 PR 0 0.0002560279 0.1141045
## % of p-value significant per method:
## P_R P_Maths P_Stoat P_RVTest
## 99.42 99.42 99.42 99.42
##
## Difference in P-values per method [min/max/mean %]:
## Maths : min = 0, mean = 0, max = 0
## Stoat : min = 0, mean = 0, max = 0
## RVTest : min = 0.634, mean = 4.3553, max = 10.6625
## $pval_hist
##
## $pval_box
##
## $pval_hist_small
##
## $pval_diff_vs_R
##
## $pval_diff_hist
##
## $coef_hist
##
## $coef_diff_hist
## P-value distribution [min/max/mean] :
## Method Min Mean Max
## 1 PR 0 0.09983462 0.9998629
## % of p-value significant per method:
## P_R P_Maths P_Stoat P_RVTest
## 80.18 80.18 80.18 80.18
##
## Difference in P-values per method [min/max/mean %]:
## Maths : min = 0, mean = 0, max = 0
## Stoat : min = 0, mean = 0, max = 0
## RVTest : min = 0, mean = 0.2811, max = 2.5267
## $pval_hist
##
## $pval_box
##
## $pval_ecdf
##
## $pval_diff_vs_R
##
## $pval_diff_hist
##
## $coef_hist
##
## $coef_diff_hist
## P-value distribution [min/max/mean] :
## Method Min Mean Max
## 1 PR 0 0.0002883187 0.2293766
## % of p-value significant per method:
## P_R P_Maths P_Stoat P_RVTest
## 99.40 80.00 80.04 81.68
##
## Difference in P-values per method [min/max/mean %]:
## Maths : min = 335.9643, mean = 566922183802.968, max = 203644960156122
## Stoat : min = 1.1374, mean = 566920288218.151, max = 203644960156122
## RVTest : min = 156.7742, mean = 121801303130.594, max = 82299951251064.7
## $pval_hist
##
## $pval_box
##
## $pval_hist_small
##
## $pval_diff_vs_R
##
## $pval_diff_hist
##
## $coef_hist
##
## $coef_diff_hist
## P-value distribution [min/max/mean] :
## Method Min Mean Max
## 1 PR 0 0.1014515 0.9985768
## % of p-value significant per method:
## P_R P_Maths P_Stoat P_RVTest
## 80.16 80.00 80.00 80.00
##
## Difference in P-values per method [min/max/mean %]:
## Maths : min = 0.1425, mean = 516.7819, max = 58755.554
## Stoat : min = 0.1425, mean = 516.7819, max = 58755.5772
## RVTest : min = 0.0785, mean = 210.663, max = 16307.0963
## $pval_hist
##
## $pval_box
##
## $pval_ecdf
##
## $pval_diff_vs_R
##
## $pval_diff_hist
##
## $coef_hist
##
## $coef_diff_hist
## P-value distribution [min/max/mean] :
## Method Min Mean Max
## 1 PR 0 0.0002073024 0.1246804
## % of p-value significant per method:
## P_R P_Maths P_Stoat P_RVTest
## 99.62 99.62 99.62 99.62
##
## Difference in P-values per method [min/max/mean %]:
## Maths : min = 0, mean = 0, max = 0
## Stoat : min = 0, mean = 0, max = 0
## RVTest : min = 0.7557, mean = 5.1593, max = 12.4004
## $pval_hist
##
## $pval_box
##
## $pval_hist_small
##
## $pval_diff_vs_R
##
## $pval_diff_hist
##
## $coef_hist
##
## $coef_diff_hist
## P-value distribution [min/max/mean] :
## Method Min Mean Max
## 1 PR 0 0.1003566 0.999028
## % of p-value significant per method:
## P_R P_Maths P_Stoat P_RVTest
## 80.18 80.18 80.18 80.18
##
## Difference in P-values per method [min/max/mean %]:
## Maths : min = 0, mean = 0, max = 0
## Stoat : min = 0, mean = 0, max = 0
## RVTest : min = 2e-04, mean = 0.3473, max = 3.6209
## $pval_hist
##
## $pval_box
##
## $pval_ecdf
##
## $pval_diff_vs_R
##
## $pval_diff_hist
##
## $coef_hist
##
## $coef_diff_hist